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ABSTRACT 

HbVar (http://globin.bx.psu.edu/hbvar) is one of the 
oldest and most appreciated locus-specific data- 
bases launched in 2001 by a multi-center 
academic effort to provide timely information on 
the genomic alterations leading to hemoglobin 
variants and all types of thalassemia and 
hemoglobinopathies. Database records include 
extensive phenotypic descriptions, biochemical 
and hematological effects, associated pathology 
and ethnic occurrence, accompanied by mutation 
frequencies and references. Here, we report 
updates to >600 HbVar entries, inclusion of popula- 
tion-specific data for 28 populations and 27 ethnic 
groups for oc-, and p-thalassemias and additional 
querying options in the HbVar query page. HbVar 
content was also inter-connected with two other 
established genetic databases, namely FINDbase 
(http://www.findbase.org) and Leiden Open-Access 
Variation database (http://www.lovd.nl), which 
allows comparative data querying and analysis. 
HbVar data content has contributed to the realiza- 
tion of two collaborative projects to identify 
genomic variants that lie on different globin 



paralogs. Most importantly, HbVar data content 
has contributed to demonstrate the microattribution 
concept in practice. These updates significantly 
enriched the database content and querying poten- 
tial, enhanced the database profile and data quality 
and broadened the inter-relation of HbVar with other 
databases, which should increase the already high 
impact of this resource to the globin and genetic 
database community. 

INTRODUCTION 

Hemoglobinopathies are the commonest single-gene 
genetic disorders in humans, resulting from pathogenic 
genome variants in the human a-like and P-hke globin 
gene clusters (reviewed in 1). The human a-globin gene 
cluster is composed of the HBZ (OMIM number 
142310), HBA2 (OMIM number 141850), HBAl 
(OMIM number 141800), HBM (OMIM number 
609639) and HBQl (OMIM number 142240) genes, 
which encode the a2-, al- and possibly and 
6-globin polypeptides, respectively. The human P-globin 
gene cluster is composed of the HBEl (OMIM number 
142100), HBG2 (OMIM number 142250), HBGl 
(OMIM number 142200), HBD (OMIM number 142000) 
and HBB (OMIM number 141900) genes, which encode 
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the e-, "^y, '^y-, 5- and P-globin polypeptides, respectively. 
Single nucleotide substitutions or indels can lead to several 
hemoglobin variants owing to amino acid replacements, 
while molecular defects in either regulatory or coding 
regions of the human HBA2, HBAl, HBB or HBD 
genes can minimally or drastically reduce their expression, 
leading to a-, (3- or 5-thalassemia, respectively. 

HbVar database of hemoglobin variants and thalas- 
semia mutations is one of the oldest and the most- 
appreciated locus-specific databases (LSDBs), not only 
from the globin but also from the wider genetic database 
community. HbVar was launched in 2001 and derived 
from previous compilations (2,3), as a publicly available 
LSDB, to provide timely information to interested users, 
e.g. the globin research community, patients and their 
parents and providers of genetic services and counseling. 
HbVar is developed such as to accommodate the need for 
regular data entry updates and corrections, as new hemo- 
globin variants and thalassemias continue to be dis- 
covered. In addition, HbVar has a comprehensive query 
interface that allows easy access to this information, par- 
ticularly for the research community and also for phys- 
icians as an aid in diagnosis. Since its launch, HbVar has 
rapidly become an important data resource for the globin 
research community and is considered to be one of the 
premier LSDBs available to date (4). We report here 
several new updates in HbVar structure and contents, 
aiming at increasing the quality of the database, the 
accuracy and breadth of data coverage and, above all, 
its impact to the scientific community. Also, the various 
synergies with other data resources, namely LSDBs and 
National/Ethnic Genetic databases, are discussed. 

UPDATES TO EXISTING DATA 

Since the launch of HbVar (5) and the previous database 
updates in 2004 (6) and 2007 (7), HbVar information has 
been expanded by >600 additional entries and data cor- 
rections, made continually by the database curators. Also, 
to cope with the increased need of regular data updates 
and corrections, Dr Joseph Borg (University of Malta, 
MT), Dr Kamran Moradkhani (Nantes, FR) and Dr 
Philippe Joly (Lyon, FR) have joined the HbVar team 
as data curators for thalassemia mutations and hemoglo- 
bin variants, respectively. To identify new hemoglobin 
variants and thalassemia mutations not previously docu- 
mented in the database, we continued to manuaUy scan 
articles from the speciahzed journal 'Hemoglobin', which 
frequently pubhshes new hemoglobin variants and thalas- 
semia mutations, and where applicable, previously 
undocumented variants have been entered into HbVar. 

QUERY PAGE UPGRADES AND NEW 
FUNCTIONALITIES 

The HbVar query page has undergone a major refit in 
2006 (7) to improve the clarity of display. We have now 
added two additional querying options in the database 
allowing the user to query for the most recent updates, 
referring to either new entries or updates of existing 
entries (or both; Figure 1). The user can also specify the 



Query for SNPs at the human globin loci using UCSC Browsers 

All displays are using the Human Feb. 2009 Assembly (GRCh37 11119) 
Alpha globm region. UCSC Table Browser all fields display 
Beta globin complex. UCSC Table Browser all fields display 
Alpha globm region. UCSC Genome Browser 
Beta globm complex. UCSC Genome Browser 
References 

Recent additions or updates to the database 

Search for j s/ during or since | v| 



New entries only 
Updated entries only 
New or updated entries 



HbVar entries that have outside links 

Search for all HbVar entries which are also in dbSNP □ 
Search for all HbVar entries which are also in Swiss-Prot □ 
Search for all Hb\'ar entries which are also an OMI\l allele D 

Figure 1. The additions in the updated HbVar query page. Hyperhnks 
allowing the user to query for genoinic variants using the UCSC and 
PSU Genome Browsers (see also Figure 2), recent additions and/or 
updates of the HbVar data content and HbVar entries that have 
links to other databases, such as dbSNP, Swiss-Prot and OMIM (see 
text for details). 



date of the new or updated entry in the adjacent 
drop-down menu. In addition, we have included query 
options to hst aU HbVar entries that are also listed in 
other resources, such as dbSNP (http://www.ncbi.nlm. 
nih.gov/projects/SNP), Swiss-Prot (8) and OMIM (9; 
Figure 1). 

In addition to the data querying option changes, 
we have added new data visualization features. In 
addition to querying HbVar contents by gene location 
and globin chain, the user can also pre-select the output 
of the query results, in different genome browsers, namely 
the University of California at Santa Cruz (UCSC) 
Genome Browser (10) and the Pennsylvania State 
University Genome Browsers, using the dedicated 
HbVar custom tracks (Figure 2). All these upgrades 
have significantly enriched the content of the data 
output of all HbVar records, as shown in Supplementary 
Figure SI. 



INTER-RELATION WITH OTHER DATABASES AND 
SCIENTIFIC JOURNALS 

To maximize querying options and comparative data 
analysis, we have decided to share HbVar data content 
with that of other established and internationally 
renowned genetic data resources. 

FINDbase (http://www.findbase.org; 11) is a global 
database documenting allele frequencies of clinically 
relevant genomic variations, namely, causative mutations 
and pharmacogenomic markers, in various populations 
and ethnic groups worldwide. HbVar allele frequency 
data were initially shared with FINDbase developers in 
2006 and made available as an integral part of its data 
collection. Since then, data collection has been signifi- 
cantly updated with new a- and P-thalassemia mutation 
frequencies data from 28 populations and 27 ethnic 
groups and/or geographical regions that were extracted 
from the published literature (Figure 3). 
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Figure 2. Overview of the new HbVar visual display feature that couples with the PSU Genome Browser. A user can select a particular gene or 
genomic region in the human a- or P-globin cluster in the HbVar query page {HBB in this example) and the query returns all genomic variants (base 
changes, small indels and gross deletions) documented in HbVar that are present within the selected region. Note that the majority of single base 
changes are nicely clustered within the HBB gene exons. Similar displays can be obtained by using the UCSC Genome Browser using the HbVar 
custom tracks (not shown). The PSU Genome Browser toolbar is shown on top. 



Also, the entire HbVar database content has been 
copied to the Leiden Open-Access Variation database 
(LOVD; http://www.lovd.nl; 12) V2.0 platform not only 
to fulfill the need of comparative analyses with data from 
other LSDBs that have been developed in the same 
platform but, most importantly, to allow documentation 
of hemoglobinopathies-related genome variations in other 



genomic loci, namely, those leading to a-thalassemia, vari- 
ations in genes modifying thalassemia disease severity, 
quantitative trait loci, etc (see later in the text). 

Finally, HbVar database is inter-related with the scien- 
tific journal Hemoglobin such that each variant pubhshed 
in the journal should be first documented in the HbVar 
database. 
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Figure 3. HBB allele frequency data visualization in FINDbase database. (A) The present query involves all HBB gene variants with an allele 
frequency between 19.96 and 30.59%. Query returns 30 records sorted by frequency. (B) The user can zoom in to each record (red box) for details 
[this record involves the HBB:c.93-21G>A allele frequency in the Bulgarian population (24.75%), documented in the data entry displayed in the hst 
on the right (see also 11)]. 



IMPLEMENTING MICROATTRIBUTION TO 
ENCOURAGE DATA SUBMISSION 

In 2008, the scientific journal Nature Genetics introduced 
the concept of microattribution in an attempt to estabhsh 
an alternative reward system for scientific contributions, 
such as database entries and records and possibly database 
curation efforts. The principle of microattribution is . . to 
produce a publication workflow that is open to all journals 
and that draws on the expertise of all those with a stake in 
understanding variation at a particular region in the 
human genome' (13). In this way, microattribution intro- 
duces the concept of individual genome variants curation, 
so that data submitters and database curators obtain all 
due credit for their effort and/or contribution. A pre- 
requisite for this is that genomic variants should be 
deposited in stable, publicly available and well-maintained 



central repositories that would run independent 
microattribution services, based on an individual 
researcher's unique identity, defined by the Open 
Researcher and Contributor ID consortium (http://orcid. 
org). Researcher ID (http://www.researcherid.com), 
OpenID (http://openid.net) and so on (14). 

HbVar data content was used to first demonstrate the 
use of microattribution in practice (15). In particular, all 
causative globin gene variations in the a-like and P-hke 
globin genes, as well as those genomic loci that when 
mutated lead to {ATRX, KLFl, BCLllA, etc) or modify 
a hemoglobinopathy-related phenotype (MYB, MAP3K5, 
PDE7B, etc), their associated phenotypes and allele 
frequencies, where applicable, were comprehensively 
documented in the 37 LOVD-based interrelated LSDBs 
(Table 1) and assigned a unique LSDB accession 
number and IDs of the contributor(s) of the variant. 
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Table 1. Total number of variants deposited into the LOVD-related 
LSDBs (http://lovd.bx.psu.edu). The first part of the Table corres- 
ponds to the globin genes and the second part with genes not linked 
with the human ot- and P-globin gene clusters but related to (or mod- 
ifying) the resulting hemoglobinopathies phenotype and clinical 
picture 



Gene Chromosome Genomic Number of 

RefSeq ID variants 



Globin genes 



HBAl 


16pl3.3 


NG 000002.1 


320 


HBA2 


16pl3.3 


NG 000006.1 


380 


HBB 


llpl5.4 


NG 000007.3 


826 


HBD 


llpl5.4 


NG 000007.3 


91 


HBGl 


llpl5.4 


NG 000007.3 


51 


HBG2 


llpl5.4 


NG 000007.3 


65 


Genes not linked with the human 


globm gene cluster 




ALOX5AP 


13ql2 


NL 000013.10 


3 


AQP9 


15q22.1-q22.2 


NL OUOOlj.y 


1 


ARG2 


14q24.1-q24.3 


XT/"* AAAA1 A O 
NC 000014.5 


2 


ASSl 


9q34.1 


M/"" AAAAnn 1 1 

NC_UUOUOy.il 


4 


ATRX 


Xql3.1-q21.1 


XT/"" AAAAT3 1A 

NC_UU0Uz3.10 


161 


BCLllA 


2pl6.1 


XT/"* AAAAm 1 1 

NC UUOUOz.ll 


9 


CNTNAP2 


7q35-q36 


M/"* AAAAm 1 1 

NC UU0U02.11 


1 


CSNK2A1 


20pl3 


"NT/"" AAAATA 1 A 

NC UUOUiU.lO 


1 


EPASl 


2p21-pl6 


NG 01 6000. 1 


1 


ERCC2 


19ql3.3 


"NT/"" AAAA1 A O 

NC ououiy.y 


6 


FTLl 


13ql2 


"NT/"" AAAA1 1 1 A 

NC UU0013.10 


12 


GATAl 


Xp 11.23 


MP 00009 10 


[ 


GPM6B 


Xp22.2 


NC 000023.10 


7 


HA02 


Ipl3.3-pl3.1 


NC 000001.10 


1 


HBSIL 


6q23.3 


NC 000006.11 


12 


KDR 


4qll-ql2 


NC 000004.11 


4 


KL 


13ql2 


NC 000012.10 


5 


KLFl 


19pl3.13-pl3.12 


NC 000019.9 


27 


MAP2K1 


15q22.1-q22.33 


NG 008305.1 


1 


MAP3K5 


6q22.33 


NG 011965.1 


2 


MAP3K7 


6ql6.1-ql6.3 


NC 000006.11 


2 


MYB 


6q23.3 


NC 000006.11 


12 


NOSl 


12q24.2-q24.31 


NC 000012.11 


6 


NOS2 


17qll.2-ql2 


NC 000017.10 


2 


NOS3 


7q36 


NC 000007.13 


3 


NOX3 


6q25.1-q26 


NC 000006.11 


4 


NUP133 


lq42.13 


NC 000001.10 


1 


PDE7B 


6q23-q24 


NC 000006.11 


4 


SMAD3 


15q22.33 


NC 000015.9 


2 


SMAD6 


15q21-q22 


NC 000015.9 


4 


TOX 


8ql2.1 


NC 000008.10 


21 



Each genomic variant in these LSDBs included either 
pubhslied variants stored against the PubMed IDs or 
unpubhshed variants contributed by individual 
researchers or research groups involved in hemoglobin 
research. Subsequently, the microattribution tables were 
deposited in the NCBI (http://www.ncbi.nlm.nih.gov) 
public repository in an effort to measure niicrocitations 
for every data contributor or data unit centrally. The 
microattribution article itself, comprising 51 authors 
from 35 institutions, was pubhshed in 'Nature Genetics' 
in 2011 (15). 

According to HbVar data accesses and contribution, 
microattribution has been found to contribute signifi- 
cantly to the increase in data submission rate, showing 
an up to 8.2-fold increase in data submission rate as 
compared with previous years in which HbVar was 
active (15), even compared with the official HbVar 



launch year (2001), emphasizing the potential impact of 
microattribution to genome variation data submission. 
Also, apart from the increase in data contribution to 
HbVar, a number of useful conclusions were drawn 
from this approach, mostly derived from the clustering 
of HBG2/HBG1 gene promoter variants and the ATRX 
and KLFl gene coding variants, from where new 
mutation patterns have emerged (15). Such conclusions 
would not have been possible without such an approach, 
further demonstrating the value of the immediate contri- 
bution of novel genome variants to databases even though 
they would not warrant classical narrative publication 
on their own. 



DATA CONTRIBUTION IN COLLABORATIVE 
PROJECTS 

As previously shown with microattribution, the existence 
of comprehensive data repositories allows comparative 
data analysis and reciprocally drawing conclusions that 
would have otherwise not been possible. HbVar data 
content was exploited in two such collaborative projects 
to study, from an evolutionary and functional perspective: 
(i) hemoglobin variants that may be due to the same 
mutation but lie on a different ot-globin (HBAl or 
HBA2) gene and (h) hemoglobin variants and mutations 
leading to hereditary persistence of fetal hemoglobin that 
results from the same mutation but on either the HBGl or 
HBG2 (fetal globin) gene. 

In the first case, the study was performed within the 
context of the European Commission-funded ITHANET 
collaborative project (http://www.ithanet.eu), in which we 
were able to identify 14 different hemoglobin variants 
resulting from identical mutations on either one of the 
two human a-globin {HBAl or HBA2) paralogous genes 
(16). Also, in the second project, we managed to identify 
11 different hemoglobin variants resulting from identical 
mutations on either one of the two human y-globin par- 
alogous genes, while seven other promoter variants either 
result in non-deletional hereditary persistence of fetal 
hemoglobin or are benign polymorphisms (17). 

PhenCode: CONNECTING GENOTYPE WITH 
PHENOTYPE 

LSDBs are key connections between the abundance of 
genomic information and clinically important issues. In 
2006, HbVar and GenPhen databases have been adapted 
to complete a path from genome sequence to functional 
analysis, via Encyclopedia of DNA element (ENCODE, 
http://www.genome.gov/10005107; 18), to human muta- 
tions (HbVar) and to clinical phenotypes of groups of 
patients (GenPhen). The display of clinically relevant 
locus-specific mutation data on the UCSC Genome 
Browser makes it readily available to a wide audience, 
and facihtates viewing of data from many sources in one 
context. On the other hand, hnks back to the original data- 
bases allow detailed queries within individual loci, which 
can then be combined and further analyzed accordingly. 
This combination of LSDBs and powerful genome 
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browsers paves the way for drawing useful genotype- 
phenotype connections, which can be expanded to other 
loci of clinical importance, such as modifier genes. 
Examples of HbVar-related use of PhenCode are avail- 
able at http://globin.bx.psu.edu/phencode/examples (see 
also 19). 

DATABASE ACCESS 

Since their launch in January 2001, the HbVar database 
and associated resources at the Globin Gene Server 
[http://globin.bx.psu.edu), such as the onhne Syllabi, are 
regularly used worldwide. Also, HbVar is very frequently 
accessed by Facebook and mobile devices. Users fre- 
quently contact the curators and the rest of the HbVar 
team members to submit new hemoglobin variants and/ 
or thalassemia mutations, report missing information for 
existing mutants, identify inconsistencies and/or erroneous 
entries and even propose coUaborative projects related to 
HbVar data records. 

This fact not only shows how important users' input is 
to improve data quality and accuracy but also demon- 
strate the impact the HbVar has in the entire globin 
research community. 

FUTURE PROSPECTS 

HbVar has become, since the year of its launch, a key 
resource for information about sequence variation 
leading to hemoglobinopathies and is still considered an 
exemplary LSDB from the various existing ones. Key 
factors that have contributed to its success are (i) its 
constant data update and improvements, mostly driven 
by the devotion of those researchers involved in this 
project, (ii) its dynamic data querying and visualization 
tools, in conjunction with the UCSC and PSU Genome 
Browsers and (hi) its partnership with other stable and 
weU-respected international databases. The positive 
impact that HbVar has on the research community is 
also illustrated by the fact that funding, dedicated or 
related to other projects, has always been available for 
keeping this resource alive, in an environment where 
dedicated funding opportunities for database development 
and curation are often limited and very hard to find (20), 
frequently resulting in the discontinuation of many useful 
databases. The international recognition of HbVar 
also comes from the fact that although couple of other 
globin-related databases, such as Deniz [content 
migrated to the Catalog of Transmission Genetics in 
Arabs (CTGA) database (http://www.cags.org.ae/ctga_ 
search.html), 21] or the Ithanet databases (http://www. 
ithanet.eu/index.php/db), have been developed, HbVar is 
stiU considered as the key LSDB for hemoglobin research 
professionals. 

To ensure continuous HbVar data enrichment, we plan 
to implement a broader search strategy that combines 
manual and electronic search procedures and also tighter 
links to the scientific journal 'Hemoglobin'. Also, we plan 
to expand the inter-relation of HbVar with other relevant 
high quality databases, following the successful example 
of HbVar and GALA databases (6), the UCSC and PSU 



Genome Browsers, PhenCode (19), FINDbase (11) and 
LOVD (12). 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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